Progress Report on “ Big Data Mining ”

نویسنده

  • John A. Keane
چکیده

Big Data consists of voluminous, high-velocity and high-variety datasets that are increasingly difficult to process using traditional methods. Data Mining is the process of discovering knowledge by analysing raw datasets. Traditional Data Mining tools, such as Weka and R, have been designed for single-node sequential execution and fail to cope with modern Big Data volumes. In contrast, distributed computing frameworks such as Hadoop and Spark, can scale to thousands of nodes and process large datasets efficiently, but lack robust Data Mining libraries. This project aims to combine the extensive libraries of Weka with the power of the distributed computing frameworks Hadoop and Spark. The system aims to achieve scalability to large volumes by partitioning big datasets and executing Weka algorithms against partitions in parallel. Both frameworks support the MapReduce paradigm. In MapReduce, Map functions process dataset partitions in parallel and Reduce functions aggregate the results. Weka learning algorithms can be enclosed in classes (wrappers) that implement the Map interface and generate models on dataset partitions in parallel. Weka Meta-Learners can be enclosed in the Reduce interface and aggregate these models to a single output. Weka wrappers for the first version of Hadoop, that already exist in Weka packages, were edited and compiled against the second (latest) version. A Hadoop2 cluster was built locally for testing and the system was tested in a variety of classification tasks. The system was then installed on AWS to carry out experiments at larger scales. Preliminary results demonstrate linear scalability. The Spark framework was installed locally and was tested for interoperability with Hadoop MapReduce tasks. As expected since both systems are Java-based, Hadoop tasks can be executed on both systems and the existing solution is possible to be used in Spark. The final part of the project will use this observation and implement wrappers for Weka algorithms on Spark. By taking advantage of its main-memory caching mechanisms, it is possible to greatly improve system performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

25+ Years of Business Intelligence and Analytics Minitrack at HICSS: A Text Mining Analysis

This research project is inspired by the occasion of the 50th anniversary of the Hawaii International Conferences on Systems Sciences (HICSS). As the current co-chairs of the longest-running minitrack on Business Intelligence (BI), Business Analytics (BA) and Big Data (as it is currently known) at HICSS, we report on its 27-year history of relevant and interesting research. Our insights into th...

متن کامل

Large-scale correlation mining for biomolecular network discovery

Continuing advances in high-throughput mRNA probing, gene sequencing and microscopic imaging technology is producing a wealth of biomarker data on many different living organisms and conditions. Scientists hope that increasing amounts of relevant data will eventually lead to better understanding of the network of interactions between the thousands of molecules that regulate these organisms. Thu...

متن کامل

Data Mining for Traffic Prediction and Analysis using Big Data

Today we are living in a data-driven world. Developments in data generation, gathering and storing technology have empowered organizations to gather data sets of massive size. Data mining is a term that blends traditional data analysis methods with cultured algorithms to handle the tasks stood by these new forms of data sets. This paper is a comparative analysis of various Data Mining of traffi...

متن کامل

Survey on Data Mining Algorithm and Its Application in Healthcare Sector Using Hadoop Platform

In this survey paper, we have scrutinized and revealed the benefits of Hadoop in the Healthcare sector using data mining where the data flow was in massive volume. In developing countries like India with huge population, there exists various problems in the field of healthcare with respect to the expenses met by the economically underprivileged people, access to the hospitals and research in th...

متن کامل

Design and Test of the Real-time Text mining dashboard for Twitter

One of today's major research trends in the field of information systems is the discovery of implicit knowledge hidden in dataset that is currently being produced at high speed, large volumes and with a wide variety of formats. Data with such features is called big data. Extracting, processing, and visualizing the huge amount of data, today has become one of the concerns of data science scholar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014